I am going to focus on the Republican Primary in Iowa, plot some maps and graphs and see if I can figure out any patterns in the ways Iowans voted.

1. Packages and Data Munging:

library(tigris)
library(leaflet)
library(dplyr)
library(acs)
library(ggplot2)
library(plotly)
library(grid)
library(gridExtra)

primary <- read.csv("PrimaryResults.csv", stringsAsFactors = FALSE)

str(primary)
## 'data.frame':    1585 obs. of  7 variables:
##  $ State            : chr  "Iowa" "Iowa" "Iowa" "Iowa" ...
##  $ StateAbbreviation: chr  "IA" "IA" "IA" "IA" ...
##  $ County           : chr  "Adair" "Adams" "Allamakee" "Appanoose" ...
##  $ Party            : chr  "Democrat" "Democrat" "Democrat" "Democrat" ...
##  $ Candidate        : chr  "Hillary Clinton" "Hillary Clinton" "Hillary Clinton" "Hillary Clinton" ...
##  $ Votes            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ FractionVotes    : num  0.56 0.57 0.51 0.55 0.63 0.56 0.53 0.55 0.53 0.53 ...

Basically we have the fraction of votes for each candidate in every county in Iowa and New Hampshire. Next I will filter the original dataframe to get the Iowa Republican results. Also I am mostly interested in the winners by county so I will use the dplyr package to create a new data frame that shows more clearly who won in each county and what percentage of votes they took.

state = "Iowa"
party = "Republican"

Votes <- primary %>% 
            filter( State == state, Party == party) %>% 
            group_by(County) %>% 
            summarize(Candidate = Candidate[which.max(FractionVotes)],
                                            Vote = max(FractionVotes))
head(Votes)
## Source: local data frame [6 x 3]
## 
##      County    Candidate      Vote
##       (chr)        (chr)     (dbl)
## 1     Adair Donald Trump 0.2567901
## 2     Adams     Ted Cruz 0.2977941
## 3 Allamakee Donald Trump 0.2813411
## 4 Appanoose Donald Trump 0.3480334
## 5   Audubon     Ted Cruz 0.3609626
## 6    Benton     Ted Cruz 0.3647491

2. Demographics of Iowa

To make more sense of the results we need to know more about the demographics for Iowa. Based on the few articles I read about the Republican primary race I’d be curious to see how the following demographic factors come into play:

Unfortunately it’s somewhat hard to find a systematic way of getting this data. The closer thing I could find is the acs package - altough I couldn’t find tables related to everything I was interested in. I thus instead opted to just get the demographics from the [State Data Center of Iowa] (http://www.iowadatacenter.org/browse/counties.html). Unfortunately they only give .xls files so I just copied and pasted the data I was interested in and then saved it as a csv file - hopefully I didn’t mess anything up. You can find the file here. Now all we have to do is join that table with our original table.

iowa.demographics <- read.csv("https://raw.githubusercontent.com/apapiu/Rplots/master/iowa_counties.csv")

Votes <- inner_join(Votes, iowa.demographics, by = "County")
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
names(Votes)[5:9] <- c("Population", "Age", "Income","College", "Urban") 
str(Votes)
## Classes 'tbl_df', 'tbl' and 'data.frame':    99 obs. of  9 variables:
##  $ County    : chr  "Adair" "Adams" "Allamakee" "Appanoose" ...
##  $ Candidate : chr  "Donald Trump" "Ted Cruz" "Donald Trump" "Donald Trump" ...
##  $ Vote      : num  0.257 0.298 0.281 0.348 0.361 ...
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Population: int  8243 4482 14675 13721 6830 25308 128012 26224 23325 21093 ...
##  $ Age       : num  41.8 41.9 39.7 40.6 42.4 37.2 34.4 38.6 38.1 36.4 ...
##  $ Income    : int  39803 33425 37168 32113 34805 46382 46087 45629 47242 42207 ...
##  $ College   : num  11.2 12 14.4 12.2 12.3 13.9 23 18.8 21.5 12.7 ...
##  $ Urban     : int  0 0 28 41 0 30 85 48 33 28 ...

Great, now we have a bunch of new variables that we can use:

3.Maps, Plots and Exploration

Let’s first map out the winners. We will do so using two asweome packages: tigris and leafet. The basic workflow is like this: tigris provides the geospatial county data that we will overlay over a leaflet map and fill in the colors based on the winning candidate. I choose tigris instead of the given kaggle spatial data to make this report more reproducible. As you will see using tigris is very easy!

Counties <- counties(state = state) #gives the geospatial files

Next we join the sptial data Counties with the Votes data. We have to make sure we choose the right columns to join - in our case we’re joining by county. Sometimes this will require a bit of data cleaning but in our case we’re good.

Merge <- geo_join(Counties, Votes, "NAME", "County")

And finally we use leaflet to plot the map:

pal = colorFactor(c("dark red","dark green", "blue"), domain = Merge$Candidate)
popup = paste0(Merge$County, " County")

leaflet() %>% 
    addProviderTiles("CartoDB.Positron") %>% 
    addPolygons(data = Merge, weight = 1, color = "white",
                smoothFactor = 0.2, popup = popup,
                fillColor = ~pal(Merge$Candidate), fillOpacity = .7) %>% 
    addLegend(pal = pal, value = Merge$Candidate,
              title = "Winner by County")

We have our first map! Unfortunately without knowing more about the demographics of Iowa we can’t really say too much. It does seem that Rubio won the more densely populated, urban areas while Trump and Cruz split the remaining counties. Now let’s see if we can use the demographic data to gain any insight. Let’s plot some maps based on the demographic variables:

Let’s also look the average demographics of candidates based on who won:

Avg <- Votes %>% 
        group_by(Candidate) %>%
        summarize(mean(Income), mean(Urban), mean (College), mean(Age)) 
Avg
## Source: local data frame [3 x 5]
## 
##      Candidate mean(Income) mean(Urban) mean(College) mean(Age)
##          (chr)        (dbl)       (dbl)         (dbl)     (dbl)
## 1 Donald Trump     40733.11    37.86842      14.72895  39.41316
## 2  Marco Rubio     56513.80    76.80000      34.70000  31.96000
## 3     Ted Cruz     41100.61    33.80357      15.80536  39.28750

The table above give us an idea of what the “average” county looks like in terms of the winner. Younger, less educated people tend to vote for Trump or Cruz while older, more educated voters tend to choose . Trump and Cruz seem pretty hard to differentiate - the only significant difference seems to be that the counties where Cruz wins are more rural. Let’s actually plot the winning candidates based on the four demographics variables we have - we’re going to do them 2 by 2. Note that the size of the balls below are proportional to the fraction of votes the winning candidate received.

The graphs above are quite striking - the first one shows a signifcant gap in income and education between Rubio on one hand and Trump and Cruz on the other hand - in fact, the data is almost linearly separable. From the second graph we can see some trends in the second plot with Rubio winning younger, more urbanized states - we should however be weary of any forceful statements however since we don’t have that many data points.

4. Trying out some models

4.1 A closer look at Rubio

Based on the graphs above is seems like Rubio’s appeal lies within urban, wealthy, well-educated areas. To get a better understanding of this however we’ll go back to the actual percentages of votes Rubio got in each County and see how much the number of votes is influenced by our 4 demographics factors: income, percentage of people with a college degree, and percentage of people living in urban areas and median age. Let’s take a look:

There seems to be a postitive correlation between the percentage of votes Rubio gets and how wealthy and urbanized a specific county is. It’s a bit unclear if college plays a role based on the graph above. Let’s take it one step further and build a prediction model of Rubio. Now I’m no Nate Silver so we’ll just good old linear regression with the four demographic features that we have.

summary(lm(FractionVotes ~ Urban + Income + College + Age, data = rubio))
## 
## Call:
## lm(formula = FractionVotes ~ Urban + Income + College + Age, 
##     data = rubio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5125  -2.4244  -0.3343   2.2943  12.2393 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -6.73220   12.48068  -0.539  0.59088   
## Urban        0.01375    0.01970   0.698  0.48681   
## Income       0.37268    0.12340   3.020  0.00325 **
## College      0.26178    0.11173   2.343  0.02124 * 
## Age          0.13479    0.23431   0.575  0.56648   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.026 on 94 degrees of freedom
## Multiple R-squared:  0.4182, Adjusted R-squared:  0.3934 
## F-statistic: 16.89 on 4 and 94 DF,  p-value: 1.824e-10

Let’s analyze the results a bit. The p-values we get suggest that there is a relation between the predictor variables of college and income and the percentage of votes Rubio gets. For every $1000 increase in the median household income, Rubio would get .37% added to his fraction of votes.

4.2 A Model of Cruz

Let’s do the same analysis for Cruz as well.

Ted Cruz’s voting trends are in a lot of ways opposite to those of Rubio - he seem to fare better in poorer, less urbanized and less educated counties. The linear regression below show the median age, levels or urbanization and education levels to be somewhat related to the response. Notice however the p-values.

summary(lm(FractionVotes ~ Urban + Income + College + Age, data = rubio))
## 
## Call:
## lm(formula = FractionVotes ~ Urban + Income + College + Age, 
##     data = rubio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1955  -2.5278  -0.2405   3.5180  14.9628 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 66.14908   14.86170   4.451 2.35e-05 ***
## Urban       -0.04329    0.02346  -1.845   0.0681 .  
## Income      -0.22024    0.14694  -1.499   0.1373    
## College     -0.22934    0.13304  -1.724   0.0880 .  
## Age         -0.55668    0.27901  -1.995   0.0489 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.794 on 94 degrees of freedom
## Multiple R-squared:  0.1874, Adjusted R-squared:  0.1528 
## F-statistic:  5.42 on 4 and 94 DF,  p-value: 0.0005693

4.3 How about Trump?

Trump’s harder to pin down. It seems he is more successful in poorer, less educated counties - however the linear regresion below tells us that only the college variable is significantly related to the fraction of votes Trump receives. Roughly speaking he loses .33% for every 1% increase in the proportion of the county population with college degrees - notice again the large standard errors however.

summary(lm(FractionVotes ~ Urban + Income + College + Age, data = rubio))
## 
## Call:
## lm(formula = FractionVotes ~ Urban + Income + College + Age, 
##     data = rubio)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7961  -2.6019  -0.0077   2.0391  16.0343 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 26.99744   14.53021   1.858  0.06630 . 
## Urban        0.03434    0.02294   1.497  0.13766   
## Income      -0.05939    0.14366  -0.413  0.68027   
## College     -0.36105    0.13008  -2.776  0.00665 **
## Age          0.17932    0.27278   0.657  0.51255   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.687 on 94 degrees of freedom
## Multiple R-squared:  0.2075, Adjusted R-squared:  0.1738 
## F-statistic: 6.154 on 4 and 94 DF,  p-value: 0.0001922

5. Conclusions

This is clearly still a work in progress and it’s probably a little too early to make any conclusions. Ideally I’d like to get more data…the issue I think will be getting the demographic data for New Hampshire, Nevada and South Carolina and put it in a